A Type System and Analysis for the Automatic Extraction 
and Enforcement of Design Information 


Patrick Lam 
Laboratory for Computer Science 
Massachusetts Institute of Technology 
Cambridge, MA 02139 


plam@mit.edu 


ABSTRACT 


We present a new type system and associated type checker, 
analysis, and model extraction algorithms for automatically 
extracting models that capture aspects of the design of the 
program. Our type system enables the developer to place a 
token on each object; this token serves as the object’s rep- 
resentative during the analysis and model extraction. The 
polymorphism in our type system enables the use of general- 
purpose classes whose instances may serve different purposes 
in the computation; programmers may also hide the details 
of internal data structures by placing the same token on all 
of the objects in these data structures. 

Our combined type system and analysis provide the model 
extraction algorithms with sound heap aliasing information. 
Our algorithms can therefore extract both structural models 
that characterize object referencing relationships and behav- 
ioral models that capture indirect interactions mediated by 
objects in the heap. Previous approaches, in contrast, by 
an absence of aliasing information, have focused on control- 
flow interactions that take place at procedure call bound- 
aries. We have implemented our type checker, analysis, and 
model extraction algorithms and used them to produce de- 
sign models. Our experience indicates that it is straight- 
forward to produce the token annotations and that the ex- 
tracted models provide useful insight into the structure and 
behavior of the program. 
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1. INTRODUCTION 


Design abstractions such as object models [11] and mod- 
ule dependency diagrams are a central feature of many soft- 
ware development processes. In this capacity they provide 
a way to quickly and easily explore design alternatives and 
give the members of the design team a common and effec- 
tive language for communicating important aspects of the 
design. 

In principle, the design abstractions should remain a pri- 
mary source of information about the program for its entire 
lifetime. But the standard practice is for programmers to 
manually implement the design once it has been finalized, 
raising the possibility of the implementation diverging from 
the design. This divergence becomes ever more likely over 
the lifetime of the program, limiting the credibility of the 
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original design and therefore its utility as a source of in- 
formation about the program. In most cases, the design 
is eventually discarded and the code becomes the primary 
source of information about the program. 

This paper presents a new type system and an associated 
analysis that together support the automatic extraction of 
design-level information from the source code. The goal is 
to establish a guaranteed connection between the program 
and its design, restore the credibility of the design as a re- 
liable source of information about the program, and enable 
developers to use design abstractions effectively throughout 
the entire lifetime of the program. 

We focus on abstractions that involve the structure of the 
heap and the information flow (or lack of such flow) between 
different subsystems. One particularly novel aspect of our 
technique is that it accurately captures even indirect interac- 
tions mediated by objects in the heap. Existing approaches, 
in contrast, focus only on the direct interactions that take 
place at procedure or method calls. 

The key idea behind our approach is to allow the developer 
to use the type system to place a token (chosen from a finite 
set of tokens fixed at program analysis time) on each object 
in the program; this token serves as the object’s representa- 
tive during the analysis that extracts the design information 
from the program. This approach addresses several common 
problems that complicate the effective automatic extraction 
of design information: 


e Multiple Design Elements, Single Code Element: 
Well-structured programs factor common behavior and 
structure into a single, general-purpose code element 
(for example, a container class or object factory). Dif 
ferent instantions of such an element often have dis- 
tinct conceptual purposes in the computation and should 
therefore correspond to different elements in the de- 
sign. But standard analysis approaches treat each code 
element as a unit, conflating the attributes of its dif- 
ferent instantiations and failing to capture important 
design-level distinctions. 


The polymorphism in our type system eliminates this 
problem. It allows the developer to place different to- 
kens on different instantiations of the same class, en- 
abling the analysis to separate objects with different 
conceptual purposes even if the objects happen to be 
instances of the same general-purpose class. 


e Single Design Element, Multiple Code Elements: 
Because the design captures aspects of the computa- 


tion at a higher level of abstraction than the code, 
multiple code elements are often required to imple- 
ment a single design element. For example, a primary 
object may maintain complex internal data structures 
that the design abstracts as conceptually part of the 
object. Any approach that fails to abstract these in- 
ternal data structures will deliver an overly detailed 
model that obscures key aspects of the design. 


Our type system addresses this problem by allowing 
the developer to place the same token on both the pri- 
mary object and all of the objects that implement its 
internal data structures. The analysis then treats the 
entire collection of objects as a unit and appropriately 
coalesces the combined information from all of the ob- 
jects into a single design element. 


Consider, for example, a set object with an internal 
linked list of references to items in the set. Our sys- 
tem allows the developer to place the same token on 
both the set object and all of the linked list objects, 
with a separate token on the items that the list nodes 
reference. In the extracted models, the set and all of 
its internal linked list nodes comprise a single abstrac- 
tion. Because the items in the set have a different 
token, they correspond to a separate abstraction. 


Aliasing: To accurately extract structural informa- 
tion (for example, referencing relationships between 
objects) and behavioral information (for example, how 
information flows between subsystems), the analysis 
needs to have information about the aliasing relation- 
ships in the heap. An expensive whole-program pointer 
analysis is the standard way to obtain this information. 
Pointer analyses typically use the creation site of each 
object to represent the object during the analysis, in 
which case the analysis results conflate all objects allo- 
cated at the same site and fail to appropriately coalesce 
internal objects. 


In our type system, the type of each object completely 
characterizes the referencing relationships (at the gran- 
ularity of tokens) in the part of the heap reachable 
from that object. Instead of processing all of the load 
and store statements to construct a model of the heap, 
our analysis can simply propagate token information 
across procedure boundaries to substitute out the to- 
ken variables in the polymorphic types. The resulting 
ground types provide the required aliasing informa- 
tion. 


Our analysis produces the following models: 


e Object Models: An object model identifies the kinds 
of objects in the heap and characterizes the relation- 
ships between these different kinds of objects [11]. We 
model the objects and relationships at the granularity 
of tokens. Specifically, there is a node in the model 
for each token. There is a labelled edge between two 
tokens if the heap may contain two objects represented 
by the tokens and one object may contain a reference 
to the other. The label identifies the field containing 
the reference. There is also an edge from ¢1 to te if 
a method, executing such that this has token t1, can 
create a reference to an object of token te. 


Building the model at the granularity of the tokens 
separates conceptually distinct instances of the same 


class and enables the model to appropriately capture 
the different structural relationships associated with 
these different instances.The standard approach, in con- 
trast, operates at the granularity of classes and fails to 
capture these distinctions [17]. 


Subsystem Access Models: These models charac- 
terize how subsystems access objects. Each of these 
models is a bipartite graph. There is a node for each 
token and a node for each subsystem, with an edge 
from a subsystem to a token if the subsystem may ac- 
cess an object represented by the token. 


Interaction Models: Interaction models character- 
ize interactions between subsystems at the granularity 
of tokens. We support two kinds of models: 


— Call/Return Interaction Model: This model 
characterizes the direct interactions that take place 
at method calls and returns. The nodes in the 
call/return model are subsystems. There is a solid 
directed edge from subsystem s, to s2 if a method 
in s; invokes a method in sz. The edge is labelled 
with the tokens that represent the objects passed 
as parameters in any s; method calling s2. There 
is a dashed directed edge from s2 to s1 if some 
method in the s2 subsystem returns a result to a 
method in s;. The edge is labelled with all tokens 
representing objects returned from s2 to 81. 


— Heap Interaction Model: This model charac- 
terizes the indirect interactions that take place at 
reads and writes to and from objects in the heap. 
The nodes in this model are tokens. There is a 
solid directed edge between two tokens if a sub- 
system writes a reference to an object represented 
by the first token into an object represented by 
the second token. The label on the edge identifies 
the subsystem that performed the write. There 
is a dashed directed edge between two tokens if a 
subsystem reads a field in an object represented 
by the first token and obtains a reference to an 
object represented by the second token. The la- 
bel on the edge is the subsystem that performed 
the read. 


This model smoothly generalizes to support higher- 
level actions (such as insertions and removals) on 
abstract data types (such as hashtables and lists). 


Together, these models enable the developer to trace 
all of the dependences between and flow of informa- 
tion through the subsystems in the program. They 
also support useful projection operations — to focus 
on a particular aspect of the interactions, the devel- 
oper selects the relevant subsystems or tokens, then 
hides those parts of the model that do not involve these 
subsystems or tokens. The resulting projected models 
clearly expose the properties of interest. 


Our enhanced subsystem models succinctly capture all 
of the information in standard subsystem interaction 
models (which focus on aspects of the control flow; in 
particular, on how methods in one subsystem invokes 
methods in other subystems). But the availability of 
a sound, relevant model of the heap also enables the 
analysis to characterize not only the control flow but 


also the information flow that occurs at method calls. 
Perhaps more significantly, it can also characterize how 
subsystems access data and capture indirect subsystem 
interactions mediated by objects in the heap. 


This paper makes the following contributions: 


e Polymorphic Token Type System: It presents 
a polymorphic type system that allows developers to 
place a token on each object. This type system is 
structured as an extension to Java, and includes a type 
checking algorithm that determines if the type decla- 
rations are correct. 


e Analysis and Model Extraction Algorithms: It 
presents an analysis algorithm and model extraction 
algorithms that, together, use the type system to ex- 
tract models that capture aspects of the design of the 
program. This extraction-based approach ensures that 
the models correctly reflect the design of the program. 
In contrast with many previous approaches, the pres- 
ence of sound heap aliasing information enables the 
extraction of both structural models that characterize 
object referencing relationships and behavioral models 
that capture indirect interactions mediated by objects 
in the heap. 


e Experience: We have implemented our type system, 
analysis, and model extraction algorithms. We have 
used these algorithms to produce design models. Our 
experience indicates that it is straightforward to pro- 
duce the token annotations and that the extracted 
models provide useful insight into the structure and 
behavior of the program. 


2. EXAMPLE 


We next present an example that illustrates how our anal- 
ysis produces the interaction models. Figure 1 presents a 
program in which a driver coordinates the activities of a 
producer and a consumer. The producer and consumer in- 
teract via a stack of objects; the driver creates the stack, 
then repeatedly invokes the producer (which pushes some 
Integer items on to the stack) and the consumer (which 
pops the Integer items off of the stack). There are two kinds 
of interactions: call/return interactions in which the stack 
flows between the driver, the producer, and the consumer, 
and heap interactions in which the produced items flow from 
the producer through the stack to the consumer. We next 
discuss how our analysis produces models that present in- 
formation about this program. 


2.1 Subsystems 


Our analysis describes the behavior of the system at the 
granularity of subsystems. Each subsystem corresponds to 
a set of method invocations that serve the same concep- 
tual purpose in the computation. Our example contains 
four subsystems: the MAIN subsytem that executes the main 
method, the EP (Event Producer) subsystem that produces 
the data, the EC (Event Consumer) subsystem that con- 
sumes the data, and the ED (Event Driver) subsystem that 
invokes the EP and EC subsystems.’ 


'ln practice, we would expect the subsystems to be much 
larger. We adopt this fine subsystem granularity in our ex- 
ample for expository purposes. 
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oken P, C, D, PCS, PCr: 


subsys EP, EC, ED; 


class Int<i> { 


} 


int v; 
Int(int v) { this.v = v; } 


class Node<s,i> { 


} 


Cc 
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Cc 
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Cc 
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Cc 
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Node<s,i> next; 
Int<i> data; 


lass Stack<s,i> { 
private Node <s,i> first; 
public void push (Int<i> k) { 
Node <s,i> n = new Node<s,i>(); 
n.data = k; 
n.next = first; 
first =n; 
} 
public Int<i> pop { 
Int<i> r = first.data; 
first = first.next; 
return r; 


} 


lass Producer<p,s,i> entry EP { 

int n = 0; 

public void produce(Stack<s,i> s) { 
s.push(new Int<i>(nt+); 


} 
lass Consumer<c,s,i> entry EC { 
Int<i> r; 
public void consume(Stack<s,i> s) { 
r = s.pop(); 
} 


lass Driver<d> entry ED { 
public void enter() { 
Stack<PCS,PCI> s = new Stack<PCS,PCI>() ; 
Producer<P,PCS,PCI> p = 
new Producer<PT,PCS,PCI>(); 
Consumer<C,PCS,PCI> c = 
new Consumer<C,PCS,PCI>(); 
p-produce(s) ; 
c.consume(s) ; 


} 


lass ProducerConsumer { 
public static void main(String[] argv) { 
new Driver<D>().enter(); 


} 


Figure 1: Example Producer/Consumer Program 


The program identifies some of the classes as subsystem 
entry points. In our example, the program uses the enter EP 
clause to identify all of the methods in the Producer class 
as entry points to the EP subsystem, and similarly for the 
EC and ED subsystems. 

Once the program enters a subsystem, it remains within 
that subsystem until it invokes a method in a class that is 
an entry point for a different subsystem. So in our example, 
execution starts within the MAIN subsystem, then moves into 
the ED subsystem when the main method invokes the enter 
method. The ED subsystem then invokes the EP and EC sub- 
systems to produce and consume the data. 

Note that because the push and pop methods are not sub- 
system entry points, invocations of these methods are part 
of the same subsystem that invoked them. This approach 
enables the construction of general-purpose classes that may 
be used for different purposes in different subsystems. 


2.2 Polymorphic Token Types 


Each class has a set of token parameters. The first pa- 
rameter identifies the token placed on the class, while the 
other parameters are used to declare the types of the ref- 
erence fields of instances of the class. In our example, the 
Stack <s, i> class has two parameters: the token vari- 
able s identifies the token placed on stack instances, while 
the token variable i identifies the token placed on items in 
the stack. The class can use these token variables to de- 
clare the types of its reference fields and the types of the 
parameters of its methods. 

The program specifies values for the token parameters at 
object creation sites. In our example, the enter method uses 
the statement Stack<PCS,PCI> s = new stack<PCS,PCI>(); 
to create a new instance s of the Stack class with tokens PCS 


(producer/consumer stack) and PCI (producer/consumer item). 


This object creation site uses concrete token values (PCS and 
PCI). It is possible, however, for the program to use token 
variables to specify the tokens at object creation sites. Con- 
sider, for example, the object creation site new Int<i>(n++) ; 
inside the produce method. This site uses the token vari- 
able i to identify the token placed in the newly created Int 
object. 

As our example illustrates, token variables support a form 
of polymorphism in which different instantiations of the same 
class can have different tokens. This mechanism supports 
general classes whose instances serve different conceptual 
purposes in the computation. 


2.3 Analysis 


The goal of our analysis is to compute, at the granularity 
of tokens, the referencing relationships within the program. 
This information allows the analysis to characterize struc- 
tural relationships in the heap. It also serves as a foundation 
for computing behavioral information about how subsystems 
access and share information. 

Our analysis processes the object creation and method call 
statements to propagate token variable binding information 
from callers to callees. In effect, the analysis substitutes out 
all of the token variables from all of the types, replacing 
the variables with the concrete tokens that actually appear 
when the program runs. 

In our example, the analysis propagates token bindings 
from the enter method to the produce and consume meth- 
ods as follows. At the call to the produce method, the anal- 


ysis uses the declared types of p and s to generate the bind- 
ing [p +> P,s + PCS,i +> PCI] for the token variables in 
the produce method. It then propagates these bindings to 
generate the binding [s +> PCS, i ++ PCI] for the token vari- 
ables in the push method. In a similar way, the analysis can 
substitute out the token variables in the consume and pop 
methods to obtain a complete set of bindings for all of the 
token variables in the program. 

The token propagation algorithm also propagates the cur- 
rent subsystem identifier between invoked methods. The 
combined analysis result contains both the token variable 
bindings and a binding that indicates the subsystems that 
may execute each method. So, in our example, the analysis 
computes that the push method may execute as part of the 
EP subsystem, and that the pop method may execute as part 
of the EC subsystem. 

At this point, the analysis can use the bindings to com- 
pute, for each local variable, the set of tokens that represent 
the objects to which the variable may refer. As described 
below in Sections 4.4, 4.5, and 4.6, this information enables 
the analysis to produce models that characterize the objects 
that each subsystem may access and the ways that informa- 
tion may flow between subsystems. 

As described below in Section 4.3, the bindings at object 
creation sites, when combined with the type declarations for 
object fields, enable the analysis to produce an object model 
that characterizes the referencing relationships between ob- 
jects at the granularity of tokens. 

Finally, the question may arise how to combine binding 
information when different method invocations may have 
different token variable bindings. Our framework supports 
both context sensitive approaches (which provide a sepa- 
rate result for each different combination of the values of 
the token variables and subsystems in each method) and 
context-insensitive approaches (which combine the different 
contexts to generate a single mapping of token variables to 
possible values valid for all executions). An intermediate 
approach combines contexts from the same subsystem but 
keeps contexts from different subsystems apart. 


2.4 Object Models 


In our system, the concrete type of each object, in com- 
bination with the types of the objects that it (transitively) 
references, characterizes the structure of the heap reachable 
from the object. Once our analysis has computed the bind- 
ings for the token variables at each object allocation site, 
it can use the type declarations for the fields of the object 
to build an object model that characterizes the referencing 
relationships in the part of the heap reachable from that ob- 
ject. This object model is a labelled, directed graph. The 
nodes in the graph correspond to tokens; there is an edge 
between two tokens if one of the objects represented by the 
first token may contain a reference to an object represented 
by the second token. The label on the edge is the name of 
the field that may contain the reference. 

By combining the object models from each of the object 
creation sites, the analysis can produce a single object model 
that characterizes, at the granularity of tokens, all of the ref- 
erencing relationships in the entire heap. In some cases it is 
also desirable to summarize local variable referencing rela- 
tionships in the object model. Our tool can therefore process 
the local variable declarations to insert an unlabelled edge 
between two tokens if a method of an object represented by 


data 


Figure 2: Object Model for Producer/Consumer 
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Figure 3: Subsystem Usage Model for Example 


the first token has a local variable that may refer to an ob- 
ject represented by the second token. Figure 2 presents the 
object model from our example; this object model contains 
the unlabelled edges from local variables.” 


2.5 Subsystem Usage Models 


Our analysis processes the statements in each method in 
the context of the token variable binding information to ex- 
tract a subsystem usage model. This model characterizes 
how subsystems access objects at the granularity of tokens. 
Each subsystem usage model is a bipartite graph. The nodes 
in the graph correspond to subsystems and tokens; there is 
an edge connecting a subsystem and a token if the subsystem 
may access objects represented by the token. 

Figure 3 presents the subsystem usage model from our 
example program. The square nodes represent subsystems; 
the ellipse nodes represent tokens. The edge between EP and 
PCS, for example, indicates that EP may access the stack used 
to pass values between the producer and consumer. 


2.6 Call/Return Interaction Models 


Call/return interaction models characterize the control 
and data flow transfers that take place when a method in 
one subsystem invokes a method in a different subsystem. 


?We have implemented our type system, analysis, and model 
extraction algorithms. To ease the construction of the 
parser, it accepts a language whose surface syntactic details 
differ a bit from those in our example. For example, our 
implemented system encloses token parameters in *< and 
*> instead of < and >. We use the dot graph presentation 
system [12] to automatically produce graphical representa- 
tions of our extracted models. All of the pictures in this 
paper were automatically produced using our implemented 
system. 


Figure 4: Call/Return Interaction Model for Exam- 
ple 


The model itself is a labelled, directed graph. The nodes 
correspond to subsystems; there is a solid edge between two 
subsystems if a method in the first subsystem may invoke a 
method in the second subsystem. There is a dashed edge if 
the second method may return an object to the first subsys- 
tem. The labels on the edges are the tokens that represent 
the objects passed as parameters or returned as values. 

We use the analysis results to extract the call/return inter- 
action model as follows. At each method call site, we retrieve 
the bindings that the analysis has computed for each of the 
token variables in the types of the parameters. These bind- 
ings identify the tokens that represent the objects passed as 
parameters from the caller to the callee. We also extract the 
subsystems for the caller and the callee. 

If the callee is an entry method, the analysis generates a 
solid edge between the two subsystems and labels the edge 
with the set of tokens that represent the parameters. If the 
invoked method returns an object, it also generates a return 
edge, using the analysis results at the return statement(s) 
in the callee to extract the tokens on the label of the return 
edge. Figure 4 presents the call/return interaction model in 
our example. 


2.7 Heap Interaction Models 


Heap interaction models capture the indirect interactions 
that take place via objects in the heap. The nodes in this 
model correspond to tokens. There is a solid edge between 
two tokens if a subsytem may write a reference to an object 
represented by the first token into an object represented 
by the second token; there is a dashed edge (in the opposite 
direction) if a subsystem may read that reference. The label 
on each edge is the subsystem that performed the write or 
the read. We remove a node (and its incident edges) from 
the final graph if all of its incident edges have the same 
subsytem label. 

We use the analysis results to compute the heap inter- 
action model as follows. At each statement that writes a 
reference from one object to another object, we retrieve the 
subsystems that may execute the statement, and, for each 
subsystem, the tokens that represent the two objects. There 
is an edge between each possible pair of tokens that repre- 
sent the source and target objects. The label on each such 
edge is the corresponding retrieved subsystem. 

Figure 5 presents the heap interaction model for our ex- 
ample. The solid lines indicate that the EP subsystem may 
write a PCI object into a PCS object and that the EC subsys- 
tem may write a PCI object into a C object. The dashed line 
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Figure 5: Heap Interaction Model for Example 


indicates that the EC subsystem may read the PCI object 
back out of the PCS object. 

Note that we have placed the PCS token on both the Stack 
object and the Node objects that implement the Stack’s in- 
ternal state, in effect collapsing all of the objects to a single 
abstraction in the heap interaction model (and other mod- 
els as well). This is an example of how tokens allow the 
developer to hide irrelevant detail in the generated models. 


2.8 Discussion 


As this example illustrates, extracting and using pointer 
analysis information is relatively straightforward given the 
polymorphic token declarations. This information allows us 
to create a broad range of models that characterize the heap 
structure of the program, its information access behavior, 
and both the direct and the indirect information flow be- 
tween its subsystems. 

We note that our analysis has more information about the 
program than it presents in the extracted models. We have 
chosen our specific set of models based on our expectations 
of what developers would find most useful. We envision, 
however, a much richer interactive program exploration sys- 
tem that would allow developers to customize the models 
to include more or less detail depending on their current 
needs. To cite just one example, the developer could choose 
to display the name and method of each local variable that 
generated a given unlabelled edge in the object model. Such 
a system would give developers appropriate access to all of 
the information that the analysis extracts. 


3. TYPE SYSTEM 


We next present a formal treatment of a type system. The 
type system is used to check token consistency constraints: 
we check that declared tokens match actual tokens, and that 
the aliasing and access restrictions of tokens are satisfied by 
the actions of the code. The type system ensures that our 
models are sound; for instance, when a token is changed, 
the type system has ensured that the object had only one 
heap reference to it, so that the token change is valid. Our 
static type system is realized as a set of typing rules for a 
simplified core language; in particular, note that we omit 
subsystems from the type checking rules we present here. 
Subsystems propagate identically to tokens, and induce no 
constraints on the runtime behaviour of the program. 


3.1 Static Type System 


Figure 6 presents the static type rules that define the type 
checker. The type system enforces the following token con- 
straints: 


e The program never creates multiple references to a 
unique or borrowed object. 


e The program never writes a borrowed object to a field. 


e The program never writes to a field of a read-only ob- 
ject. 


e The program never accesses global tokens from meth- 
ods labelled action. 


Formally, a program consists of a sequence of class defi- 
nitions, containing method, field and token definitions (see 
Rule [PROG] in Figure 6). The goal is to derive the type 
judgement + P, indicating that the program satisfies the 
static type constraints. 

The type system checks each method in turn by using its 
owning class’ token definitions and where clauses, in con- 
junction with the method parameter definitions, to con- 
struct an initial typing environment for the method (see 
Rule [PROC]). The type system then checks each statement 
of the method in turn (Rules [STMT ACQUIRE] through 
[STMT DESTR WRITE}). For each statement, it attempts 
to derive a typing judgement of the form P;E + s, which 
indicates that the statement type-checks in the context of 
the program P and the typing environment FE. The typing 
environment FE binds variables to types and provides the 
list of formal token variables along with constraints on their 
possible kinds. 

We next discuss how the type system enforces the ba- 
sic consistency requirements for the different statements. 
Consider the Rules [EXP FIELD READ] and [EXP VAR 
READ]. The token associated with, respectively, x.fd or x, 
must not have unique or borrowed kind; these objects may 
only be read in the context of a destructive access statement. 

The Rule [STMT INVOKE] ensures that a method call 
may only occur when the necessary conditions hold. We 
verify that all tokens used as actual token parameters ex- 
ist (P;E Ftoken Qi). We extract the kind kj, corresponding 
to each actual method parameter e;; if it is U, then either 
e; must be destructively read (e; = ej--) or we must add 
a borrowing token constraint to our augmented typing en- 
vironment E’, ensuring that token aj, can use kind B in 
the callee’s context. Finally, the callee’s required token con- 
straints c; must hold in the augmented typing environment 
(P;E' Ftconstr Ci). 

Next consider the type rule for destructive field writes 
x.fd = y-- (Rule [STMT DESTR WRITE)). Safety con- 
ditions are ensured by preventing writes to read-only ob- 
jects (ke # R) and writes of borrowed values (ky # B). 
As in Rule [EXP FIELD READ], we check that permission 
to write to tz is available (kz = BV ky =U). For a non- 
destructive field write, we add the additional restriction that 
y’s token may be neither borrowed nor unique: ky ¢ {B,U}. 

Finally, consider the type Rules [STMT DESTR COPY] 
for destructive copy x = y-- and [STMT DESTR READ] 
for destructive read x = y.fd--. Destructive copy requires 
the source object in y to either be unique (k¥ =U), in which 
case we require token compatibility: token ¢? must equal t7 
if k? 4 U; or, for non-unique source objects (ki? #4 U), we 


require the tokens to match: t” = t¥. A similar condition 
holds for destructive write, except that we replace t¥ by u 
and kY by ki. 


4. ANALYSIS AND MODEL EXTRACTION 


We next present the analysis and model extraction algo- 
rithms. The purpose of the analysis is to determine all of 
the possible token variable bindings for each method. The 
model extraction algorithms use the bindings to produce the 
models. 


4.1 Preliminaries and Notation 


The program defines a set of tokens t € T, token variables 
p,v € VUT, aset of methods m € M, aset of subsystem iden- 
tifiers s € S, a set of classes k € K, a set of call sites c € C, 
and a set of object creation sites o € O. Each class k has a 
set of fields f € fields(k). Each call site c may invoke a set of 
methods m € callees(c); we compute the call graph informa- 
tion using a variant of class hierarchy analysis. Each call site 
c is contained in a method method(c) and each object cre- 
ation site o is contained in a method method(o). If a method 
m is an entry method, then entry(m) is its subsystem iden- 
tifier s, otherwise entry(m) = same, where same is a special 
identifier indicating that each invocation of m is part of the 
same subsystem as its caller. The type of an object created 
at an object creation site o is k (v1,...,vi) = type(o), where 
k is the class of the new object and v1,...,v, are the actual 
token parameters of the new object. Each local variable 
lv € LV has a type k (v1,..., vz) = type(lv). Each class k has 
a set of formal token parameters (p,,..., Pp ,) = parms(k) and 
a set of object references f k(vi,...,vi), where k(vi,..., vz) 
is the type of the object which field f references. 

The analysis produces bindings b € B= TUV — T; we 
require that b(t) =t for all t € T. The identity function on 
tokens is Id = At.t. 


4.2 Analysis 


The analysis propagates binding information from caller 
to callee to compute a set of calling contexts for each method. 
More specifically, for each method m, it produces a set of 
tuples (s,b) € contexts(m). This set of tuples satisfies the 
following context soundness condition:? 


If 


e cis acall site with +1 actual parameters whose types 
are kg (Ue ata Wann ., ki (Wi eae 


e cis inside a method mc = method(c), 


(s, b) € contexts(mc), m € callees(c), and 


m has !+ 1 formal parameters whose types are 
ko PinceidPearrensy 6 er se F 


then (s’ [Pi He b(v)0<j<L1l<i< nj|Uld) € contexts(m), 
where s’ = s if entry(m ) = same, otherwise s’ = entry(m). 


The analysis produces an analysis result that satisfies this 
condition by propagating token bindings in a top-down fash- 
ion from callers to callees starting with the main method. It 


3Note that constructors are treated just like any other 
method in this analysis. 


“By convention, the receiver is parameter 0. 


initializes the analysis by setting contexts(main) = { (MAIN, ld) }. 
It uses a fixed-point computation within strongly connected 
components of the call graph to ensure that the final result 
satisfies the context soundness condition. Note that this al- 
gorithm produces a completely context-sensitive solution in 
that it records each context separately in the analysis result. 
It is also possible to adjust the algorithm to merge contexts 
and produce a less context-sensitive result. 


4.3 Object Model Extraction 


Figure 7 presents the object model extraction algorithm. 
This algorithm produces a set of nodes N C T and a set of 
labelled edges of the form (t1, f,t2); each such edge indicates 
that the field f in an object represented by token ti may 
contain a reference to an object represented by token te. 
The algorithm processes all of the object creation sites o in 
the program; for each site, it uses the token variable bindings 
produced by the analysis to determine the potential token 
instantiations for objects created at that site. It then uses 
the bindings to trace out the part of the heap reachable from 
objects created at that site. The visit algorithm uses a set 
V of visited class/binding pairs to ensure that it terminates 
in the presence of recursive data structures. 


set N=0,E=0,V =0 
for all object creation sites o € O 
let m = method(o) 
let k(vi,..., vz) = type(o) 
let (pi, .--, Py) = parms(k) 
for all (s,b) € contexts(m) 
visit(k, [p; > b(vi).1 <7 < UJ] Uld) 


visit(k, b) 
if (k,b) ¢ V then 
let (v1,..-, Vi) = parms(k) 


set N = NU {b(v1)} 
set V=VU {(k,b)} 


for all f k’(vi,...vj) © refs(k) 
set B= Eu{(b (v1), f, b(v,)) } 
let (Pi; $i ah = parms(k’ ) 
visit(k’, [pj > b(vj).1 <i <j] Uld) 


Figure 7: Object Model Extraction Algorithm 


Note that this algorithm produces only the labelled edges 
for the heap references. Our implemented algorithm also 
processes the local variable declarations to add the unla- 
belled edges that summarize potential referencing relation- 
ships associated with the local variables in each class. 


4.4 Subsystem Access Model Extraction 


Figure 8 presents the subsystem access model extraction 
algorithm. It produces a set of nodes N C SUT and a set of 
edges F of the form (s,t); each such edge indicates that the 
subsystem s may access an object represented by token t. 
The algorithm processes all of the accesses in the program, 
retrieving the binding information produced by the analysis 
to determine 1) the subsystems that can execute the access 
and 2) the tokens that represent the accessed objects. 


4.5 Call/Return Interaction Model Extraction 
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ClassesOnce(P) FieldsOnce(P) 
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Figure 6: Type Rules 


set N=0,E=90 
for each method m 
for each access lv.f in m 
let k (vi,..., Ww) = type(lv) 
for each (s,b) € contexts(m) 
set N = N U {s, b(v1)} 
set E = EU {(s,b(v1))} 


Figure 8: Subsystem Access Model Extraction Al- 
gorithm 


Figure 9 presents the call/return model extraction algo- 
rithm. It produces a set of nodes N C S and a set of edges 
E of the form (si,t,s2). The algorithm processes all of the 
call sites in the program, retrieving the binding information 
produced by the analysis to determine 1) if the call site may 
invoke an entry method of a different subsystem, and 2) if 
so, the tokens that represent the objects passed as param- 
eters between the subsystems. Note that there is an edge 
for each such token. To eliminate visual clutter, our model 
display algorithm coalesces all edges between the same two 
subsystems, producing a single edge with a list of the tokens 
passed as parameters between the subsystems. 


set N=0,E=9@ 
for each call site c 
for each (s, b) € contexts(method(c)) 
for each m € callees(c) 
let s’ = entry(m) 
if s’ A same and s’ #8 then 
set N = NU {s,s‘} 
let ko (vf,... Mas wy kr (iy... Vi) be the types 
of the actual parameters at the call site c 
set E = EU {(s,b(v}),s’/).1 <i< 


Figure 9: Call/Return Model Extraction Algorithm 


The algorithm in Figure 9 does not generate the return 
edges. Our implemented algorithm generates these edges by 
similarly processing the return statements of entry methods. 


4.6 Heap Interaction Model Extraction 


The heap interaction model extraction algorithm produces 
a set of nodes N C T and two sets of edges. The write 
edges W C T x S x T summarize the write interactions; 
an edge (ti,s,t2) € W indicates that the subsytem s may 
write a reference to an object represented by token t; into an 
object represented by token tz. The read edges RC Tx Sx 
T summarize the read interactions; an edge (t1,s,t2) € R 
indicates that the subsytem s may read a reference to an 
object represented by token tz from an object represented 
by token ty. 

Figure 10 presents the algorithm that extracts the write 
interactions W. The algorithm processes all of the write 
accesses in the program, retrieving the binding information 
produced by the analysis to determine 1) the subsystems 
that may perform the write and 2) the tokens that represent 
the accessed objects. The algorithm that extracts the read 
interactions is similar. The set of nodes N is initialized to 0 
before the read and write interaction algorithms execute. To 
reduce visual clutter, the model display algorithm removes 


set W = 0 
for each method m 
for each write access lv1.f = Ive in m 
let k’ (vj,...,vz,) = type(Iv1) 
let ko (vj,..., Vin) = type(Iv2) 
for each (s, b) € contexts(m) 
if (b(vz) # b(v7) then 
set N = NU {b(vt), b(v7)} 
set W = W U{(b(vj), 8, b(vt))} 


Figure 10: Heap Interaction Model Extraction Al- 
gorithm 


all nodes whose incident edges all have the same label. 


5. EXPERIENCE 


We have implemented a prototype version of our sys- 
tem by extending the Kopi Java compiler®. We tested our 
approach on Tagger, a text formatting system written by 
Daniel Jackson. Tagger consists of 1721 lines of Java code 
and 14 classes (not counting the standard Java libraries). It 
accepts a text file augmented with formatting commands as 
input and produces as output another text file in the Quark 
document definition language. 

We first augmented Tagger with subsystem and token an- 
notations. This augmentation increased the number of lines 
of code to 1755. We added token and/or subsystem annota- 
tions to a total of 201 lines of code. This augmented version 
has the following subsystems, with one boundary class per 
subsystem: 


e Pars: The parser subsystem, which contains code to 
read the input file, group characters into words, and 
recognize formatting commands. 


e Pmap: The property management subsystem, which 
manages the data structures that control the trans- 
lation between each Tagger formatting command and 
the corresponding Quark output. 


e Act: The action subsystem, which uses property man- 
agement subsystem to translate Tagger commands into 
Quark commands, then passes the output to the Gen 
subsystem. 


e Gen: The generation subystem, which produces the 
output Quark document. This subsystem manages the 
translation of the Quark commands into a flat stream 
of output symbols. It is responsible for generating the 
surface syntax of the Quark document and producing 
the output file. 


e Eng: The engine subystem, which processes the Tag- 
ger commands and serially dispatches each command 
to the Act subsystem. 


e Main: The main subsystem, which initializes the sys- 
tem and implements the connection between the Pars 
subsystem, which reads the input file, and the Act 
subsystem, which processes the text and Tagger com- 
mands in the file. 


> Available at http://www.dms.at/kopi/ 


Of the original 14 classes, six are boundary classes in the 
annotated version. Two more are abstract superclasses of 
boundary classes. Another two are used to transfer data be- 
tween the Pars, Eng, Act, and Gen subsystems; their meth- 
ods simply store and retrieve the transferred data. Another 
class reads in the configuration data that governs the trans- 
lation from Tagger to Quark formatting commands; this 
class is encapsulated within the PMap subsystem. Another 
two store updatable processing state relating to the output 
document, for example the current position in an itemized 
list of paragraphs. These classes are encapsulated inside the 
Eng subsystem. The remaining class manages assertions. 

The augmented version has the following tokens: 


e Gen: 


e Eng: The token for instances 


To facilitate the use of code from the Java libraries, our 
implemented system generates a single token for each class in 
the library and by default places that token on each instance 
of the corresponding class. 

In general, Tagger separates classes with state from classes 
that only encapsulate code. The exceptions are the Gen, 
Num, and Counter classes, which have both state and non- 
trivial behavior. 


6. RELATED WORK 


We discuss related work in the areas of software model 
extraction, pointer analysis, and ownership types. 


6.1 Modeling Extraction 


Software models play a key role in most software devel- 
opment processes [26, 11]. Modeling is usually carried out 
during the design phase as a way of exploring and specify- 
ing the design. The design is then usually implemented by 
hand, opening up the possibility of inconsistencies between 
the design and the implementation. The software engineer- 
ing community has long recognized the need for tools to help 
ensure that the software conforms to its design [16]. Auto- 
matic model extraction is a particularly appealing alterna- 
tive, because it holds out the promise of delivering models 
that are guaranteed to correctly reflect the structure of the 
implementation. 


6.1.1 Control-Flow Interactions 


Most previous model extraction systems have focused on 
control-flow interactions. The software reflexion system, for 
example, automatically extracts an abstraction of the call 
graph and enables the developer to compare this abstraction 
with a high-level module dependency diagram [19]. Arch- 
Java augments Java with the concepts of components and 
ports. It enforces the constraint that all inter-component 
control transfers must take place through ports [1]. 

Our use of polymorphic token types and the associated 
analysis enables us to capture a wider range of design is- 
sues; specifically structural issues associated with referenc- 
ing relationships between objects in the heap and informa- 
tion flow issues associated with method invocations. Most 
importantly, we also capture indirect information flow be- 
tween subsystems that takes place via objects in the heap. 
To the best of our knowledge, all previous systems do not 
attempt to perform the analysis that would enable them to 
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capture these kinds of dependences. This raises the possi- 
bility that the extracted models fail to accurately capture 
all important interactions. 


6.1.2 Object Model Extraction 


Standard approaches for extracting object models from 
code treat each class as a unit. In type-safe languages, it is 
even possible to extract a (relatively crude) object model di- 
rectly from the type declarations of the fields in the objects. 
Problems with this approach include conflation of different 
instances of general-purpose classes and overly detailed ob- 
ject models because of a failure to abstract internal data 
structures. Womble [17] attacks the latter failure by treat- 
ing collection classes separately as relations between objects. 
Womble is also unsound in that the extracted model may fail 
to accurately characterize the referencing relationships. In 
contrast, our extracted object models are sound and avoid 
both conflation of instances of general-purpose classes and 
excessive detail associated with failing to abstract internal 
data structures. 


6.2 Pointer Analysis 


Pointer analysis has been an active area of research for 
well over 15 years. Approaches range from efficient flow- 
and context-insensitive approaches [3, 25, 24, 21, 14, 10, 
15] to potentially more precise but less efficient flow- and 
context-sensitive approaches [22, 27, 13, 7, 18, 23]. These 
approaches vary in whether they create a result for each 
program point (flow-sensitive analyses) or one result for the 
entire program (flow-insensitive analyses). They also vary 
in whether they produce a result for each calling context 
(context-sensitive analyses) or one result that is valid for 
all calling contexts (context-insensitive analyses). There are 
also flow-insensitive but context-sensitive analyses that pro- 
duce a single parameterized result for each procedure that 
can be specialized for each different calling context [20]. 

From our perspective, a primary difference between ex- 
isting pointer analysis algorithms and our approach is the 
flexibility our approach offers in selecting object represen- 
tatives. Specifically, our polymorphic type system enables 
the developer to separate objects allocated at the same ob- 
ject creation site in the generated model. We believe this 
separation is crucial to delivering models that accurately re- 
flect the conceptual purposes of the different objects in the 
computation. Of course, obtaining this additional precision 
requires the developer to provide the polymorphic type dec- 
larations. 

Another difference is that because the type declarations 
in our programs characterize the points-to relations in the 
reachable region of the heap, there is no need to analyze the 
individual store and load instructions to synthesize a points- 
to graph. Instead, the analysis can simply propagate tokens 
to substitute token variables out of the polymorphic types. 
The analysis needs to process the load and store instructions 
only to generate the heap interaction graph. 

Our approach is quite flexible in the degree of context- 
sensitivity that it provides. It is possible to tune the anal- 
ysis to produce a separate result for each combination of 
token variable and subsystem values, a result that separates 
subsystems but combines information within a single sub- 
system, or a single result for each method. 


6.3 Ownership Types 


Ownership type systems are designed to enforce object 
encapsulation properties [9, 6, 5, 8, 2]. In this capacity, 
they can be used to ensure that objects from one instance of 
an abstraction are not used to inappropriately communicate 
with other instances of the same abstraction [4, 2]. For 
example, one might use ownership types in a multithreaded 
web server to ensure that the sockets associated with one 
server thread do not escape to be used by another server 
thread. 

Our system focuses on extracting communication patterns. 
Encapsulation violations in our system therefore show up as 
unexpected communication. We would attack the problem 
of verifying encapsulation properties by enabling the devel- 
oper to state desired properties, then checking the appro- 
priate extracted model to verify that the program did not 
violate these properties. 


7. CONCLUSION 


The software engineering community has long recognized 
the need for tools to help ensure that the software conforms 
to its design. Our implemented system, with its polymor- 
phic type system, analysis, and automatic model extractors, 
takes an important step towards this goal of verified de- 
sign. Our models capture important information about the 
program; because they are automatically generated, they 
are guaranteed to accurately reflect the program’s structure 
and behavior. The sound heap aliasing information provided 
by our combined type system and analysis enables the ex- 
traction of both structural object referencing models and 
behavioral models that characterize not only direct inter- 
actions that take place at method and procedure calls, but 
also indirect interactions mediated by objects in the heap. 

We believe our approach holds out the promise of inte- 
grating the design effectively into the entire lifecycle of the 
software; today, in contrast, design models tend to become 
increasingly less reliable (and therefore less relevant) as de- 
velopment proceeds into the implementation and mainte- 
nance phases. The potential result would be a more power- 
ful and pervasive notion of design, leading to more reliable 
systems and more economical development. 
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